Capacity limits and how the visual system copes with them
نویسنده
چکیده
A visual system cannot process everything with full fidelity, nor, in a given moment, perform all possible visual tasks. Rather, it must lose some information, and prioritize some tasks over others. The human visual system has developed a number of strategies for dealing with its limited capacity. This paper reviews recent evidence for one strategy: encoding the visual input in terms of a rich set of local image statistics, where the local regions grow — and the representation becomes less precise — with distance from fixation. The explanatory power of this proposed encoding scheme has implications for another proposed strategy for dealing with limited capacity: that of selective attention, which gates visual processing so that the visual system momentarily processes some objects, features, or locations at the expense of others. A lossy peripheral encoding offers an alternative explanation for a number of phenomena used to study selective attention. Based on lessons learned from studying peripheral vision, this paper proposes a different characterization of capacity limits as limits on decision complexity. A general-purpose decision process may deal with such limits by “cutting corners” when the task becomes too complicated. Human vision is full of puzzles. Observers can grasp the essence of a scene in less than 100 ms, reporting with a fair degree of reliability whether it is a beach or a street, whether it contains any animals, and what materials are present [1, 2]. Yet when probed for details, they are at a loss. Change the scene while masking the motion transients, and the observer may have great difficulty determining what has changed, even when the change is quite visible once it has been spotted (“change-blindness”, [3, 4]). Human vision is better than the best computer vision systems ever created, yet it is also easily fooled by visual illusions. People can look at a line drawing of a 3D object, and effortlessly understand its shape, yet have difficulty noticing the impossibility of an Escher never-ending staircase. We have difficulty finding our keys, even when they prove quite visible once found and fixated, and look nothing like other items on our desk. How does one explain this combination of marvelous successes and quirky failures? It perhaps seems unsurprising that these diverse phenomena at present have no unifying explanation. What do they have in common? Certainly, scene perception, object recognition, and 3-D shape estimation require different mechanisms at some stage of visual processing. Nonetheless, might there exist a coherent explanation in terms of a critical stage of processing that determines performance for a wide variety of tasks, or at least a guiding principle for what tasks are easy and difficult? Attempts to provide a unifying account have explained the failures in terms of the visual system having limited capacity (see [5] for a review). Our senses gather copious amounts of data, seemingly far more than our minds can fully process at once. At any given instant, we are consciously aware of only a small fraction of the incoming sensory input. We seem to have a limited capacity for awareness, for memory, and for the number of tasks we can simultaneously carry out, leading to poor performance at tasks that stress the capacity limits of the system. A classic example of the limited capacity logic concerns visual search. Suppose a researcher runs an experiment in which observers must find a target item among a number of other “distractor” items. As in many such experiments, the experimenter picks a target and distractors such that individual items seem easy to distinguish. Nonetheless, the researcher finds that search is inefficient, i.e., that it becomes significantly slower as one adds more distractors. Why is search difficult? One can easily discriminate the target from the distractors when looking directly at them. The poor search performance implies that vision is not the same everywhere, or, as Julian Hochberg put it, “vision is not everywhere dense” [6]. If vision were the same throughout the visual field, search would be easy. By a popular account, the main reason vision is not the same everywhere has to do with attention, in particular selective attention. In this account, attention is a limited resource, and vision is better where the observer attends than where they do not. The visual system deals with limited capacity by serially shifting attention. Some tasks require selective attention, and as a result are subject to the performance limits inherent in having to wait for this limited resource. In the case of difficult search tasks, for instance, the targetdistractor discrimination is presumed to require attention, making search significantly slower with increasing number of display items. On the other hand, preattentive tasks do not require attention; they can be performed quickly and in parallel, leading to easy search. Selective attention is typically described as a mechanism that gates access to further visual processing [7, 8, 9] rather than engaging in processing itself. Once the visual system selects a portion of the visual input, perception happens. Throughout this paper, when I refer to selective attention, I mean a gating mechanism. Traditionally, researchers have taken visual search phenomena as evidence that selective attention operates early in the visual processing pipeline, and that correct binding of basic features into an object requires selective attention [10]. Though this account has had a certain amount of predictive power when it comes to visual search, it has been problematic overall [11, 12, 13, 14]. The need for selective attention to bind basic features seems to conflict with: the relative ease of searching for a cube among differently lit cubes [15, 16, 17]; with easy extraction of the gist of a scene [18, 19, 2, 20, 21, 22, 23] and of ensemble properties of sets [24, 25, 26]; and with what tasks require attention in a dual-task paradigm [27]. My lab has argued instead that a main way in which the visual system deals with limited capacity is through encoding its inputs in a way that favors foveal vision over peripheral. Peripheral vision is, as a rule, worse than foveal vision, and often much worse. Peripheral vision must condense a mass of information into a succinct representation that nonetheless carries the information needed for vision at a glance. Only a finite number of nerve fibers can emerge from the eye, and rather than providing uniformly mediocre vision, the eye trades off sparse sampling in the periphery for sharp, high resolution foveal vision. This economical design continues into the cortex: more cortical resources are devoted to processing central vision at the expense of the periphery. We have proposed that the visual system deals with limited capacity in part by representing its input in terms of a rich set of local image statistics, where the local regions grow — and the representation becomes less precise — with distance from fixation [28]. Such a summary-statistic representation would render vision locally ambiguous in terms of the phase and location of features. Thus, this scheme trades off computation of sophisticated image features at the expense of spatial localization of those features. One of the main implications of this theory for vision science has been the need to re-examine understanding of visual attention. Most experiments investigating selective attention have had a peripheral vision confound. A number of phenomena previously attributed to attention may instead arise in large part from peripheral encoding. This paper begins by reviewing both phenomena in peripheral vision and our model of peripheral encoding. It reviews what we have learned about perception, as well as the implications for theories of attention, particularly selective attention. Our understanding of peripheral vision constrains possible additional mechanisms for dealing with limited capacity. In particular, I propose that the brain may face limits on decision complexity, and deal with those limits by performing a simpler version of any toocomplex task, leading to poorer performance at the nominal task. A lossy encoding in peripheral vision Peripheral vision is susceptible to clutter, as evidenced by the phenomena of visual crowding. Classic crowding refers to greater difficulty identifying a peripheral target when flanked by neighboring stimuli than when it appears in isolation. Crowded stimuli may appear jumbled and uncertain, lacking crucial aspects of form, almost as if they have a textural or statistical nature [29]. Crowding has often been studied with a target identification task, and with a target object flanked by other objects, but it almost certainly affects perception more generally. Crowding points to significant qualitative differences between foveal and peripheral vision. These differences are far greater than the modest differences between foveal and peripheral acuity, and are likely task-relevant for a wide variety of tasks [30]. The phenomena of crowding have been described in detail in a number of recent review papers [31, 32, 33, 34]. My lab has argued that one must control for or otherwise account for the strengths and limitations of peripheral vision before considering explanations based upon visual attention [30, 14, 35]. Otherwise, one risks fundamental misunderstandings about both perception and attention. Whether the paradigm is visual search, change detection, dual-task, scene perception, or inattentional blindness – all tasks whose results have been interpreted in terms of the mechanisms of attention – the often-cluttered stimuli lie at least in part outside of the fovea, and are potentially subject to crowding. A number of researchers have suggested that crowding results from “forced texture perception,” in which information is pooled over sizeable portions of the visual field [29, 36, 31, 32]. Based on these intuitions, we have developed a candidate model of the peripheral encoding that we hypothesize underlies crowding. In this Texture Tiling Model (TTM), originally described in [28], the visual system computes a rich set of summary image statistics, pooled over regions that overlap and tile the visual field. Because of the association with texture perception, we chose as our set of image statistics those from a state-of-the-art model of texture appearance from [37]: the marginal distribution of luminance; luminance autocorrelation; correlations of the magnitude of responses of oriented V1-like wavelets across differences in orientation, neighboring positions, and scale; and phase correlation across scale. This seemingly complicated set of parameters is actually fairly intuitive: computing a given second-order correlation merely requires taking responses of a pair of V1-like filters, point-wise multiplying them, and taking the average over the pooling region. This proposal [28, 38] is not so different from models of the hierarchical encoding for object recognition, in which later stages compute more complex features by measuring co-occurrence of features from the previous layer [39, 40, 41, 42]. Second-order correlations are essentially co-occurrences pooled over a substantially larger area. This encoding scheme provides an efficient, compressed representation. It captures a great deal of information about the visual input. Nonetheless, the encoding is lossy, meaning one cannot reconstruct the original image exactly. We hypothesize that the information maintained and lost by this encoding provides a significant constraint on peripheral processing and constitutes an important and often task-relevant way in which vision is not the same across the visual field. The proposed lossy encoding has potential implications for virtually all visual tasks. Simply re-examining possible confounds in selective attention studies requires the ability to apply a single model to recognition of crowded peripheral targets, visual search, scene perception, ensemble perception, and dual-task experiments. In order to make predictions for this wide range of stimuli and tasks, one needs a model applicable to arbitrary images. In addition, critical to our understanding of this encoding scheme has been the use of texture synthesis methodologies for visualizing the equivalence classes of the model. Using these techniques, one can generate, for a given input and fixation, images with approximately the same summary statistics [28, 37, 43, 14, 38]. These visualizations allow for easy intuitions about the implications of the model. Figure 1BC shows two examples synthesized from the image in Figure 1A. Information that is readily available in these synthesized images corresponds to information preserved by the encoding model. Understanding a model through its equivalence classes is a relatively rare technique in human and computer vision (see [44, 37, 45] for a few notable exceptions). Visualizing the equivalence classes of TTM allows one to see immediately that many of the puzzles of human vision may arise from a single encoding mechanism [38, 28, 43, 14]. Doing so has suggested new experiments and predicted unexpected phenomena [28, 46]. On the other hand, getting intuitions from a low-to-midlevel model by viewing the model outputs is fairly common. Researchers will filter an image to mimic a modeled contrast sensitivity function (CSF), and judge whether the CSF can predict phenomenology (e.g. [47]); they will apply a center-surround filter and judge whether that can predict lightness illusions (e.g. [48]); they will look at a model’s predictions for perceived groups, and judge whether they match known perceptual organization phenomena (e.g. [49]). Furthermore, visualization of the equivalence classes has facilitated the generation of testable model predictions, allowing us to study the effects of this relatively low-level encoding on a wide range of higher-level tasks. Observers view the synthesized images, and perform essentially the original task, whether that be object recognition, scene perception, or some other task [38, 28, 50, 51, 35, 52, 53, 54]. This allows one to determine how inherently easy or difficult each task is, given the information lost and maintained by the proposed encoding. The next section reviews evidence that the proposed encoding can qualitatively – and in a many cases quantitatively – predict a range of visual perception phenomena. Limitations of peripheral vision: A factor in many phenomena, and well modeled by TTM For the last decade, we have worked to re-examine a number of visual phenomena to determine whether peripheral vision was a factor, and whether the encoding modeled by TTM can predict behavioral performance. This includes peripheral object recognition and some of the phenomena associated with the study of visual attention: visual search, scene perception, changeblindness, and dual-task performance. We have shown that TTM quantitatively predicts performance at a range of peripheral recognition tasks. Balas et al. [28] showed that a local encoding in terms of the hypothesized image statistics can predict identification of a peripheral letter flanked by similar letters, dissimilar letters, bars, curves, and photos of real-world objects. Rosenholtz, et al. [35] and Zhang et al. [52] further demonstrated that this model could predict identification of crowded symbols derived from visual search stimuli. More recently, Keshvari and Rosenholtz [51] have used the same model to explain the results of three sets of crowding experiments, involving letter identification tasks [55], classification of the orientation and position of a crossbar on t-like stimuli [56], and identification of the orientation, color, and spatial frequency of crowded Gabors [57]. In all of these cases, we made predictions based on the information encoded in a single pooling region that included both target and flankers within the critical spacing of crowding. Figure 3A plots some of these results. Note that there are no parameters in the fit of the model to data; model predictions do not merely correlate with behavioral results, but rather quantitatively predict the data. By incorporating information from multiple, overlapping pooling regions, Freeman and Simoncelli [38] showed that they could predict the critical spacing of crowding for letter triplets. Their pooling region sizes and arrangement were set to make it difficult to distinguishing between two synthesized images with the same local statistics. Peripheral discriminability of target-present from target-absent patches predicts difficulty of search for a T among Ls, O among Qs, Q among Os, tilted among vertical, and conjunction search [35]. The same is true for search conditions that pose difficulties for selective attention models: cube search vs. search for similar polygonal patterns without a 3-D interpretation [52]. Differences between foveal and peripheral vision are task-relevant for visual search. TTM, in turn, predicts the difficulty of these peripheral discrimination tasks, and thus search (Figure 3B). There is some evidence from search experiments that the model requires additional or different features (e.g. worse encoding of oblique compared to horizontal or vertical lines, and more correlations between different orientations across space). Running TTM has given us some intuitions about how to improve the model. More recently, with model in hand, we subtly changed classic search displays in ways that should not affect predictions according to traditional selective-attention-for-binding explanations. We changed stroke width, stroke length, or the set of distractors, and correctly predicted whether these changes would make search easier or more difficult [46]. A primary difficulty with early selection accounts has been the ease with which observers can perform many scene tasks. The attentional mechanism that supposedly underlies visual search difficulty has seemed incompatible with the ease with which observers can get the gist of a scene. TTM gives us, perhaps for the first time, a mechanism that can explain both difficult search and easy scene perception. We asked observers to perform a number of navigation-related and other naturalistic scene tasks both at a glance – while fixating the center of the image – and free-viewing. Figure 3C shows predictions of TTM vs. performance at a glance [50]. The prediction is quite good. This graph exaggerates the power of the model of peripheral vision, however, as some tasks are inherently difficult even when free-viewing. Figure 3D compares instead how much more difficult each task is when fixating instead of freeviewing. We see that TTM also does a reasonable job of predicting which tasks are harder when one cannot move one’s eyes, i.e. when forced to use extrafoveal vision. Although these are quantitative predictions of scene perception performance, the fit is not parameter-free; in modeling the interaction between multiple pooling regions we chose particular amounts of overlap and density of pooling regions. Freeman and Simoncelli [38] similarly modeled computations of these image statistics over multiple pooling regions. They adjusted the size of the pooling regions until observers could not tell apart two synthesized images with the same local encoding. They demonstrated that observers have trouble distinguishing between the synthesized “metamers”, even when attending to regions with large differences. One can reinterpret this result as showing that they can predict Figure 1. A. Original scene image. B,C. According to the Texture Tiling Model, these images are members of the equivalence class of (A). Details that appear clear in these visualizations are those predicted to be well encoded by the model. TTM preserves the information necessary to easily tell that this is a street scene, possibly a bus stop, with cars on the road, people standing in the foreground, and a building and trees in the background. However, given this encoding, an observer may not be certain of the details, such as the number and types of vehicles, or the number of people.
منابع مشابه
The Visual Transitivity System in Two ELT Books Series
The present study aims to investigate the system of the visual transitivity by analyzing the images of American English file (2014) and Cutting edge (2005) series based on Halliday’s (1976) systemic functional linguistics and Kress and van leuween’s (1997-2006) social semiotics. The system of visual transitivity refers to a type of process which determines how represented participants are label...
متن کاملSupply chain network design problem for a new market opportunity in an agile manufacturing system
The characteristics of today's competitive environment, such as the speed with which products are designed, manufactured, and distributed, and the need for higher responsiveness and lower operational cost, are forcing companies to search for innovative ways to do business. The concept of agile manufacturing has been proposed in response to these challenges for companies. This ...
متن کاملRainbow of Translation: A semiotic approach to intercultural transfer of colors in children's picture books
Abstract The aim of intercultural translation is to communicate. Communication is acted via verbal as well as visual means. The interaction of verbal and visual means of communication makes a set of complex situations which demand special attention in translation. One context in which the interaction of visual and verbal elements gets vital importance is children’s picture books. Color is an in...
متن کاملRainbow of Translation: A semiotic approach to intercultural transfer of colors in children's picture books
Abstract The aim of intercultural translation is to communicate. Communication is acted via verbal as well as visual means. The interaction of verbal and visual means of communication makes a set of complex situations which demand special attention in translation. One context in which the interaction of visual and verbal elements gets vital importance is children’s picture books. Color is an in...
متن کاملThe Use of Housing System in the Management of Heat Stress in Poultry Production in Hot and Humid Climate: a Review
There is a gap between the population growth and protein supply in many tropical countries where per capita income is low and the majority of people consume less protein than a daily standard for recommended protein intake. Poultry egg production remains the fastest route to bridging the protein demand-supply gap in these regions. However, poultry are faced with heat stress in the tropics which...
متن کاملA STUDY OF OPTIMAL DIMENSIONING OF QUEUES WITH RESPECT TO SOCIAL AND INDIVIDUAL PROFIT
In this paper, a system of GIG/l/K queue is considered. The optimal system's capacity (K), when the system is optimized with respect to the benefit of the entire system (social optimization) and when the criterion for optimality is individual gains (individual optimization), is determined and compared. In social optimization, the system capacity is obtained through maximization of the syst...
متن کامل